Names: Leon Kwan, Doron lisiansky, Richard Clarke github repo: https://github.com/r-clarke/bioinformatics https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119290
if (!requireNamespace("knitr", quietly = TRUE))
install.packages("knitr")
if(!("org.Hs.eg.db" %in% installed.packages()))
BiocManager::install('org.Hs.eg.db', update = FALSE)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!("DESeq2" %in% installed.packages())) {
BiocManager::install("DESeq2", update = FALSE)
}
if (!("EnhancedVolcano" %in% installed.packages())) {
BiocManager::install("EnhancedVolcano", update = FALSE)
}
if (!("apeglm" %in% installed.packages())) {
BiocManager::install("apeglm", update = FALSE)
}
if (!("pheatmap" %in% installed.packages())) {
BiocManager::install("pheatmap", update = FALSE)
}
if (!("gprofiler2" %in% installed.packages())) {
BiocManager::install("gprofiler2", update = FALSE)
}
if (!("clusterProfiler" %in% installed.packages())) {
BiocManager::install("clusterProfiler", update = FALSE)
}
if (!("M3C" %in% installed.packages())) {
BiocManager::install("M3C", update = FALSE)
}
library(M3C)
library(umap)
Attaching package: ‘umap’
The following object is masked from ‘package:M3C’:
umap
library(Rtsne)
library(clusterProfiler)
clusterProfiler v4.0.5 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141. doi: 10.1016/j.xinn.2021.100141
Attaching package: ‘clusterProfiler’
The following object is masked from ‘package:stats’:
filter
library(gprofiler2)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library ("pheatmap")
library ("RColorBrewer")
library(magrittr)
library(matrixStats)
library(ggplot2)
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply,
parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated,
eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:clusterProfiler’:
rename
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Attaching package: ‘IRanges’
The following object is masked from ‘package:clusterProfiler’:
slice
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins,
colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads,
colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges,
colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs,
rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs,
rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads,
rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor,
see 'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
library(DOSE)
library('org.Hs.eg.db')
Loading required package: AnnotationDbi
Attaching package: ‘AnnotationDbi’
The following object is masked from ‘package:clusterProfiler’:
select
library(data.table)
data.table 1.14.2 using 1 threads (see ?getDTthreads). Latest news: r-datatable.com
**********
This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode.
This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it's time to gingerly point out that there have been no similar problems on Windows or Linux.
**********
Attaching package: ‘data.table’
The following object is masked from ‘package:SummarizedExperiment’:
shift
The following object is masked from ‘package:GenomicRanges’:
shift
The following object is masked from ‘package:IRanges’:
shift
The following objects are masked from ‘package:S4Vectors’:
first, second
###### Load the data into R.######
matrix_of_data <- as.matrix(read.table("GSE119290_Readhead_2018_RNAseq_gene_counts.txt"))
coldata<-read.csv("coldata.csv",header = T,row.names=1,stringsAsFactors=T)
##### calculate per-gene expression ranges and generating a density plot #####
matrix_of_ranges <- rowRanges(matrix_of_data, rows = NULL, cols = NULL, na.rm = FALSE, dim. = dim(matrix_of_data), useNames = NA)
vector_of_ranges <- 0
for(i in 1:26364) {
vector_of_ranges[i] <- matrix_of_ranges[i,2]-matrix_of_ranges[i,1]
}
vector_of_ranges
[1] 41 1767 132 132 132 132 5 5 5 5 3 3 1 1159 3270
[16] 1598 1598 22 22 22 6 6795 780 39 124 1111 28 102 2021 7450
[31] 888 32 15 1169 1133 98732 4 1028 41 1 1 1 14 77 6
[46] 10563 4508 43 3752 167 4599 8 1832 4976 51 1677 99 4426 40 1061
[61] 7033 5415 1625 5290 129 15 11 1244 367 3203 5804 274 5384 1102 3048
[76] 307 597 4254 10132 2897 7032 3740 49505 12 32 28 10 1705 3949 13540
[91] 706 209 6716 3721 53 2980 2161 28 117 13 2897 634 10 2 207
[106] 2181 1 534 969 3 923 833 525 3940 10 53 3687 1822 561 1697
[121] 25 2 0 139 0 3 1687 359 119 17325 23 188 8122 21 388
[136] 3736 7508 374 378 1 1179 3495 3052 7 1133 3149 3152 2788 3824 2
[151] 2872 3570 599 14 35 13253 21306 2 900 24204 20 157014 124 22 2
[166] 2 10 50 8 6421 500 231 910 4853 1140 1 72 21989 2971 1317
[181] 617 4 25 4932 39241 95 19906 902 897 210 2739 1447 878 15 14160
[196] 275 8176 5336 8058 176 35 1376 2751 2 222 1121 268 7510 1895 1698
[211] 357 2 1425 5122 1131 366 794 2488 10403 7352 2488 17 13 1 1447
[226] 1 3904 28 28 79 2 1 1 7 4 3 2 8 8 8
[241] 3 2 3 3 2 1 3 1 1 1 1 2 7 7 6
[256] 2 2 3 5 5 0 1 2 2 5 11017 4457 3006 799 1200
[271] 5 81 2735 9 3 40 532 1530 78 775 641 5069 185 3 6697
[286] 211 582 12264 9 3009 6 2 5 18 32 21303 539 565 2331 11170
[301] 9 2800 1331 1 0 7928 2279 69 2 2 5 68 3412 2901 3994
[316] 2805 183 12 9 0 5 3 18070 4617 567 1 26 740 1813 2
[331] 3029 4 5 4281 17509 8807 8857 95 151 707 2586 1738 13335 5025 2535
[346] 49 1127 3 84 0 2144 1 11 3 6 3 2 31 238 39
[361] 8 5195 2223 15 3 3255 72 3644 10838 1015 2591 15594 9933 5 17752
[376] 18 536 1613 1066 4797 2635 16293 1 1 40 570 14671 3622 0 3041
[391] 17 0 1 1 1 15623 1 518 5 6 12259 10 6 4128 129
[406] 331 23559 8387 1018 63 4543 2741 100 22553 3 29344 4179 935 3462 4903
[421] 1563 1042 504 4 0 7498 12001 4 6 9 4 58 342 815 1
[436] 90 26 8 8991 21268 16 0 0 1639 1339 59 4163 79 1386 724
[451] 5571 21117 2896 1106 843 1141 41159 2450 777 97 3 3 1144 65 1962
[466] 187 2 4 2725 7751 1651 3 86 4 4013 1490 41 38661 2325 2
[481] 18825 450 2091 64 1570 851 3 10816 14 1205 1767 1479 4063 2372 999
[496] 60 1435 24 6 332 8754 3491 2 916 451 3444 3776 9 155 3773
[511] 15 255 2095 25 6935 4965 4266 1116 6348 4278 8453 61 829 887 45
[526] 190 45 58 61 727 13 1005 5527 58 12478 1535 8136 1081 4099 2
[541] 2 17 13 38 0 58965 8804 48 48 8 4623 4205 2561 796 4389
[556] 1 4 59 8 4827 268 27561 9 1472 9841 13416 1426 9 7058 5237
[571] 1884 435 119 445 11258 212 29 4 6424 58752 133 214 2859 406 1225
[586] 1072 20138 4261 1299 7011 2394 1175 416 3532 1964 3419 223 890 4260 18
[601] 12015 657 21 336 795 2 3 40 0 22 8 42 19 2642 356
[616] 76 375 863 1679 34128 7909 5890 4032 92 13020 1311 6221 2357 10820 1650
[631] 531 1944 27 2058 11474 12352 206 3390 9894 2738 775 2917 3 1165 0
[646] 60 169 4 2908 28 1154 3216 3662 42 1609 5663 42 3752 1483 5248
[661] 1703 1228 5631 1294 2270 1357 6 17 1455 1736 1681 58 30 8027 17851
[676] 51909 10255 796 307 5994 1449 22 263 6 1516 3173 215 36 1138 1712
[691] 3398 26685 6738 2568 3 3967 887 2473 480 611 620 300 3409 666 2968
[706] 2 1 95 207 7894 491 138 2957 1590 38 6122 1 1 3461 878
[721] 13 705 167 2014 61381 228 2941 467 53 1262 1054 461 15 27625 87
[736] 127 6476 20 183 5 9 783 12 13014 4541 16 1813 3820 4 838
[751] 35040 5807 630 1535 12 4 115 2705 2215 5474 6074 413 24280 27 2039
[766] 4712 80 9712 7 872 42 8123 41389 59 42 40 58 16 1217 76
[781] 501 771 2002 2805 2923 1185 73 130 1854 1299 114 214 674 23224 5373
[796] 18910 93 4042 36 1326 1226 5819 8565 6 23 6757 363 2342 6000 8649
[811] 1486 251 2 1 13 133 3 687 363 2050 7 230 6918 3 0
[826] 2 26 2 2 1 17 783 3027 5010 20 6 29 13 128 7
[841] 246 2595 18 751 1 0 850 69 2554 2882 0 0 0 4508 35
[856] 1 3802 4418 6728 47 498 3875 711 15 4362 2074 1724 2482 4247 4964
[871] 311 18 1248 4877 206 160 3938 4 5 2 752 34 1137 2592 2660
[886] 634 14466 19 1 1 10 5272 1454 27 1644 2090 1 4392 843 1
[901] 7 629 4351 10736 275 60 22 2 1199 1246 486 4 8 38718 3
[916] 1 1576 7683 76 0 11118 1959 22 2 6 2343 5 291 187 2998
[931] 24724 6 25 7 1212 2 12 291 25 1 101 14 6835 7 3
[946] 607 1816 1 1 11 105 7893 8324 24 923 3 99 83 0 602
[961] 1349 750 82 3090 1334 9 1 8227 2 2002 3823 11 11 1 5001
[976] 1142 4180 851 8345 8 194 1 36 7 419 2152 1692 11 2 23
[991] 29496 3526 15995 355 4054 37225 7 2697 2902 38
[ reached getOption("max.print") -- omitted 25364 entries ]
plot(density(vector_of_ranges), log='x')
Warning in xy.coords(x, y, xlabel, ylabel, log) :
1 x value <= 0 omitted from logarithmic plot
The size of the expression matrix generated is 26,364 x 44. There are 26,364 different genes with 44 samples. Looking at the variation of counts, there are a few instances of very large ranges with majority of ranges being smaller. The density of gene variation follows an expected poisson distribution with lambda 1.There are some genes with high expression ranges, but most genes are not as variable.
##### generate a PCA plot #######
dds <- DESeqDataSetFromMatrix(countData = matrix_of_data,colData = coldata,design = ~ dex)
dds <- dds[rowSums(counts(dds)) > 1,]
vst <- vst(dds,blind = FALSE)
plotPCA(vst, intgroup=c("dex"))
##### generate a UMAP plot #######
matrix_of_data_t <- t(matrix_of_data)
df.umap = umap(matrix_of_data_t)
dimension1 <- range(df.umap$layout)
dimension2 <- dimension1
plot(dimension1, dimension2, type="n")
points(df.umap$layout[,1], df.umap$layout[,2], col=as.integer(coldata[,"dex"])+2, cex=2, pch=20)
labels.u = unique(coldata[,"dex"])
legend.text = as.character(labels.u)
legend.pos = "bottomleft"
legend.text = paste(as.character(labels.u), "")
legend(legend.pos, legend=legend.text, inset=0.03,col=as.integer(labels.u)+2,bty="n", pch=20, cex=1)
Neither UMAP nor PCA, show clear clustering. However, it should be noted that our dataset may not follow the assumptions necessary for dimensionality reduction for both PCA and UMAP. PCA assumes linearity in the dataset and UMAP assumes uniform distribution on a Reimann manifold.
#### Perform differential analysis on the samples from your two groups #####
deseq_object <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 308 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
deseq_results <- results(deseq_object)
deseq_results <- lfcShrink(deseq_object,coef = 2, res = deseq_results )
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
head(deseq_results)
log2 fold change (MAP): dex SZ vs Control
Wald test p-value: dex SZ vs Control
DataFrame with 6 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
DDX11L1 15.2075 0.0185428 0.144337 0.785012 0.891035
WASH7P 1157.3822 0.1255759 0.110817 0.140908 0.348768
MIR6859-3 84.1270 0.1824412 0.131918 0.057109 0.200511
MIR6859-2 84.1270 0.1824412 0.131918 0.057109 0.200511
MIR6859-4 84.1270 0.1824412 0.131918 0.057109 0.200511
MIR6859-1 84.1270 0.1824412 0.131918 0.057109 0.200511
deseq_df <- deseq_results %>%
# make into data.frame
as.data.frame() %>%
# the gene names are row names -- let's make them a column for easy display
tibble::rownames_to_column("Gene") %>%
# add a column for significance threshold results
dplyr::mutate(threshold = padj < 0.05) %>%
# sort by statistic -- the highest values will be genes with
# higher expression
dplyr::arrange(dplyr::desc(log2FoldChange))
head(deseq_df)
plotCounts(dds, gene = "DDX11L1", intgroup = "dex")
We are performing differential expression analysis. In our DESeq2 object we designated our dex variable which represent the group as the model argument. Because of this, the DESeq function will use groups defined by dex to test for differential expression. We will use lfcShrink in order to decrease noise and preserve large differences between groups. We can see for example that The DDX11L1 gene samples have similar expression in both groups.
Volcano plot
#volcano plot here
volcano_plot <- EnhancedVolcano::EnhancedVolcano(
deseq_df,lab = deseq_df$Gene,x = "log2FoldChange",
y = "padj",pCutoff = 0.01 )
Warning: Ignoring unknown parameters: xlim, ylim
volcano_plot
Summary of volcano plot
in volcano plot we are presenting analysis between two groups, which presents the log fold-change vs. the log10 p-value.
Heatmap
#heatmap
vst <- vst(dds,blind = FALSE)
sampleDists <- dist(t(assay(vst)))
sampleDists
Sample_LI-01 Sample_LI-02 Sample_LI-03 Sample_LI-04 Sample_LI-05 Sample_LI-06 Sample_LI-07
Sample_LI-02 144.42095
Sample_LI-03 96.21431 133.97114
Sample_LI-04 99.87450 139.71166 37.77497
Sample_LI-05 83.60813 156.00739 101.78338 103.29480
Sample_LI-06 119.44816 116.22757 120.67800 123.34283 96.11964
Sample_LI-07 95.37821 155.36706 112.59969 113.70332 76.26127 115.94728
Sample_LI-08 95.26113 141.14019 109.07132 108.09854 78.42244 101.09101 39.36980
Sample_LI-09 80.44241 130.44555 99.27533 104.59342 58.80801 75.32779 84.06512
Sample_LI-10 84.61049 159.95587 101.84089 97.85090 50.30949 101.46811 77.90681
Sample_LI-11 89.07397 124.77483 80.79587 83.05840 91.87756 124.53992 90.01049
Sample_LI-12 42.10256 153.45828 95.05450 93.44808 80.13373 119.25132 91.88427
Sample_LI-13 93.57608 146.53943 89.62124 92.59267 71.68491 108.62476 78.17900
Sample_LI-14 98.06298 144.25071 93.04808 91.18505 75.18324 100.26206 80.47143
Sample_LI-15 88.27419 145.39448 96.89075 99.89449 72.05981 108.87130 68.31503
Sample_LI-16 91.47799 134.87317 95.56142 94.36748 76.91570 96.16823 75.87451
Sample_LI-17 110.21659 156.31258 108.56673 112.58090 93.57011 121.32108 79.21296
Sample_LI-18 115.32256 165.33256 111.34182 106.54065 96.07806 125.66399 82.89720
Sample_LI-19 117.31859 167.04564 123.81656 126.89681 116.25921 147.07707 92.33292
Sample_LI-20 117.22968 160.07730 119.69261 118.73346 114.31221 136.86663 91.68664
Sample_LI-23 115.82222 159.04777 117.81344 119.68953 106.69378 132.85037 86.20527
Sample_LI-24 117.43344 161.46974 117.82703 114.48324 106.97566 133.33938 86.94077
Sample_LI-25 114.69350 134.34106 78.37575 79.54452 105.88925 117.89541 115.89225
Sample_LI-27 107.69827 150.74338 98.20624 102.74555 101.89783 130.49242 101.04756
Sample_LI-08 Sample_LI-09 Sample_LI-10 Sample_LI-11 Sample_LI-12 Sample_LI-13 Sample_LI-14
Sample_LI-02
Sample_LI-03
Sample_LI-04
Sample_LI-05
Sample_LI-06
Sample_LI-07
Sample_LI-08
Sample_LI-09 78.70855
Sample_LI-10 75.79231 61.25290
Sample_LI-11 90.30398 96.56694 92.11518
Sample_LI-12 91.16393 84.22445 73.28430 86.55308
Sample_LI-13 80.65945 81.27663 78.69382 80.22686 89.71738
Sample_LI-14 76.53043 83.31679 75.45347 85.40616 90.37432 37.77026
Sample_LI-15 72.22288 79.79228 80.26742 82.77390 86.44219 39.10570 46.18250
Sample_LI-16 68.34821 79.38293 78.20575 85.62742 86.45278 48.33529 39.87597
Sample_LI-17 81.19500 97.19180 98.54813 102.53927 108.27463 81.33364 84.02504
Sample_LI-18 79.18897 107.02390 90.33651 103.42611 104.93807 87.07794 80.62000
Sample_LI-19 97.36230 118.90778 119.27992 114.69949 117.15625 106.04336 110.89265
Sample_LI-20 89.17978 115.67221 112.69894 112.69567 113.06653 104.30384 102.93066
Sample_LI-23 87.29910 108.25278 108.02152 108.10184 113.69641 98.01200 99.75891
Sample_LI-24 83.42641 112.45640 102.49730 107.78078 110.80109 100.49780 96.56208
Sample_LI-25 112.34720 106.37446 107.46017 83.09330 111.47755 81.20942 83.53271
Sample_LI-27 103.66152 101.12589 103.96036 95.66504 108.89038 90.49695 96.61505
Sample_LI-15 Sample_LI-16 Sample_LI-17 Sample_LI-18 Sample_LI-19 Sample_LI-20 Sample_LI-23
Sample_LI-02
Sample_LI-03
Sample_LI-04
Sample_LI-05
Sample_LI-06
Sample_LI-07
Sample_LI-08
Sample_LI-09
Sample_LI-10
Sample_LI-11
Sample_LI-12
Sample_LI-13
Sample_LI-14
Sample_LI-15
Sample_LI-16 40.29908
Sample_LI-17 81.23823 85.23606
Sample_LI-18 87.91855 82.64446 52.15115
Sample_LI-19 101.00510 106.81797 76.73758 90.05378
Sample_LI-20 99.72611 98.16905 76.69849 76.56972 43.88827
Sample_LI-23 93.77137 96.33758 61.38807 72.16698 58.01886 58.54225
Sample_LI-24 95.63683 92.77982 71.02220 61.30604 70.22445 56.41068 42.82230
Sample_LI-25 92.85541 90.58481 116.67682 116.47200 136.97377 131.50994 126.77162
Sample_LI-27 95.68051 100.95401 104.69435 111.40654 113.49061 114.35801 111.71844
Sample_LI-24 Sample_LI-25 Sample_LI-27 Sample_LI-28 Sample_LI-29 Sample_LI-30 Sample_LI-31
Sample_LI-02
Sample_LI-03
Sample_LI-04
Sample_LI-05
Sample_LI-06
Sample_LI-07
Sample_LI-08
Sample_LI-09
Sample_LI-10
Sample_LI-11
Sample_LI-12
Sample_LI-13
Sample_LI-14
Sample_LI-15
Sample_LI-16
Sample_LI-17
Sample_LI-18
Sample_LI-19
Sample_LI-20
Sample_LI-23
Sample_LI-24
Sample_LI-25 125.40196
Sample_LI-27 114.67548 101.83543
Sample_LI-32 Sample_LI-33 Sample_LI-34 Sample_LI-36 Sample_LI-37 Sample_LI-38 Sample_LI-39
Sample_LI-02
Sample_LI-03
Sample_LI-04
Sample_LI-05
Sample_LI-06
Sample_LI-07
Sample_LI-08
Sample_LI-09
Sample_LI-10
Sample_LI-11
Sample_LI-12
Sample_LI-13
Sample_LI-14
Sample_LI-15
Sample_LI-16
Sample_LI-17
Sample_LI-18
Sample_LI-19
Sample_LI-20
Sample_LI-23
Sample_LI-24
Sample_LI-25
Sample_LI-27
Sample_LI-40 Sample_LI-41 Sample_LI-42 Sample_LI-43 Sample_LI-44 Sample_LI-45 Sample_LI-46
Sample_LI-02
Sample_LI-03
Sample_LI-04
Sample_LI-05
Sample_LI-06
Sample_LI-07
Sample_LI-08
Sample_LI-09
Sample_LI-10
Sample_LI-11
Sample_LI-12
Sample_LI-13
Sample_LI-14
Sample_LI-15
Sample_LI-16
Sample_LI-17
Sample_LI-18
Sample_LI-19
Sample_LI-20
Sample_LI-23
Sample_LI-24
Sample_LI-25
Sample_LI-27
Sample_LI-47
Sample_LI-02
Sample_LI-03
Sample_LI-04
Sample_LI-05
Sample_LI-06
Sample_LI-07
Sample_LI-08
Sample_LI-09
Sample_LI-10
Sample_LI-11
Sample_LI-12
Sample_LI-13
Sample_LI-14
Sample_LI-15
Sample_LI-16
Sample_LI-17
Sample_LI-18
Sample_LI-19
Sample_LI-20
Sample_LI-23
Sample_LI-24
Sample_LI-25
Sample_LI-27
[ reached getOption("max.print") -- omitted 20 rows ]
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vst$dex, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
#Heatmap of sample-to-sample distances
pheatmap(sampleDistMatrix,clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,col = colors)
topVarGenes <- head(order(rowVars(assay(vst)), decreasing = TRUE), 20)
topVarGenes
[1] 23946 23945 24456 15776 24464 1660 10189 24311 1652 24481 6057 24463 22385 15849 24483 10106 10188
[18] 20848 10186 21054
mat <- assay(vst)[topVarGenes, ]
mat<-mat-rowMeans(mat)
anno <- as.data.frame(colData(vst)[c("dex")])
pheatmap(mat, annotation_col = anno)
Summary of heatmap plot we calculate the distance matrix and apply clustering , In the sample distance heatmap made previously, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, for demonstration, we are selecting the 20 genes with the highest variance across samples. Treatment status and cell line information are shown with colored bars at the top of the heatmap. Blocks of genes that covary across patients.
Method/Ontology 1
#Method 1 code
topVarGenesGO <- head(order(rowVars(assay(vst)), decreasing = TRUE), 100)
topVarGenesGO
[1] 23946 23945 24456 15776 24464 1660 10189 24311 1652 24481 6057 24463 22385 15849 24483 10106 10188
[18] 20848 10186 21054 17462 16399 24462 22 24480 15850 24129 2580 20649 24457 20650 9553 24465 12523
[35] 4972 4259 10185 22806 10187 10184 15491 3454 20648 9739 21229 18715 17622 21773 3455 22294 7321
[52] 10181 10908 20645 21227 23012 20653 5159 4477 5001 3052 15389 23982 13827 21651 3453 20454 3720
[69] 4449 17605 19980 11498 17055 1707 17248 6386 6061 1509 20557 23515 23131 3711 3755 19040 22399
[86] 6996 5606 2536 18929 11051 6170 1511 10518 18091 15206 2875 20651 14865 7057 2894
mat_100 <- assay(vst)[topVarGenesGO, ]
gostres <- gost(query = rownames(mat_100),
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = FALSE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", sources = NULL, as_short_link = FALSE)
names(gostres)
[1] "result" "meta"
head(gostres$result, 6)
names(gostres$meta)
[1] "query_metadata" "result_metadata" "genes_metadata" "timestamp" "version"
plot1 <- gostplot(gostres, capped = TRUE, interactive = TRUE)
plot1
plot <- gostplot(gostres, capped = FALSE, interactive = FALSE)
plot
publish_gosttable(gostres, highlight_terms = gostres$result[c(1),],
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size"),
filename = NULL)
The input 'highlight_terms' is a data.frame. The column 'term_id' will be used.
summary of Method/Ontology 1 method :gprofiler2 Ontology : Gene Ontology In this enrichment analysis we Extracted the list of the top 100 differentially expressed genes and run gost function. The function enables to perform functional profiling of gene lists. The function performs statistical enrichment analysis to find over-representation of functions from Gene Ontology The x-axis represents the functional terms that are grouped and color-coded according to data sources and positioned according to the fixed “source_order”. The order is defined in a way that terms that are closer to each other in the source hierarchy are also next to each other in the Manhattan plot The y-axis shows the adjusted p-values in negative log10 scale. Every circle is one term and is sized according to the term size, i.e larger terms have larger circles.
Method/Ontology 2
#Method 2 code
#get differentially expressed genes
dds <- DESeqDataSetFromMatrix(countData = matrix_of_data,colData = coldata,design = ~ dex)
res <- results(DESeq(dds))
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 308 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
#store values
geneList = res[,2]
#create list of genes
names_list = rownames(res)
#lead gene names into symbols variable
symbols <- rownames(res)
#map symbols to IDs
gene_ids <- mapIds(org.Hs.eg.db, symbols, 'ENTREZID', 'SYMBOL')
'select()' returned 1:many mapping between keys and columns
#update ID's in list of gene names
for(i in 1:26364){
names_list[i] = gene_ids[names_list[i]]
}
names(geneList) = names_list
geneList = sort(geneList, decreasing = TRUE)
gene <- names(geneList)[abs(geneList) > 1.5]
x <- enrichDO(gene = gene,
ont = "DO",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
universe = names(geneList),
minGSSize = 5,
maxGSSize = 500,
qvalueCutoff = 0.05,
readable = TRUE)
output_table <- x@result
setDT(output_table)
output_table
y <- gseDO(geneList,
minGSSize = 120,
pvalueCutoff = 0.2,
pAdjustMethod = "BH",
verbose = FALSE)
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (6.27% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are duplicate gene names, fgsea may produce unexpected results.
output_table <- y@result
setDT(output_table)
output_table
Summary of Method/Ontology 2 Method :clustProfiler Ontology : Disease Ontology
The results of the enrichment provide a table that did not contain any specific indicatation of schizophrenia. When ordering the table based on p-value, some related disorders to schizophrenia include dementia and bipolar disorder which are psychological disorders, however not schizophrenia specifically. There can be many reasons for this lack of findings, however the most possible might be that schizophrenia is not in the database used in the disease ontology for this enrichment.
##### clustProfiler Gene Ontology#####
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (6.27% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are duplicate gene names, fgsea may produce unexpected results.
Warning in fgseaMultilevel(...) :
For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.
output_table <- ego3@result
setDT(output_table)
output_table
As expected, the gene sets from the gene ontology are clearly related to schizophrenia, a neurodevelopmental disorder. The first few entries of the table all relate directly with significant differences with genes for the synapse, in integral part of neuron interaction for the to brain function. For example, a decrease in postsynaptic density, listed fourth on the table, dendritic spine density, also listed in the table, has been linked to SZ in postmortem brain tissue [1].
[1] Amber Berdenis van Berlekom, Cita H Muflihah, Gijsje J L J Snijders, Harold D MacGillavry, Jinte Middeldorp, Elly M Hol, René S Kahn, Lot D de Witte, Synapse Pathology in Schizophrenia: A Meta-analysis of Postsynaptic Elements in Postmortem Brain Studies, Schizophrenia Bulletin, Volume 46, Issue 2, March 2020, Pages 374–386, https://doi.org/10.1093/schbul/sbz060